Predicting the MLB playoff teams¶

Myunghyun Pyo(48954443)¶

Stat 6309¶

Introduction¶

These days, most of the famous sports in the United States such as Football(NFL), Basketball(NBA), Baseball(MLB), and Hockey(NHL) try to gather data as much as they can to perform better and beat the opponents. Among those sports, baseball(MLB) has the biggest data and they started analyzing those data using Sabermetrics. Compare to other sports, baseball is more known for information sport. Because, in other sports, players bring the ball with them so it is hard to track what players do in that short time, but for baseball, when the pitchers throw the ball no other distraction comes to that situation. Baseball games are more team sport than other sports, so it cannot be decided by one or two superstars to decide the outcome of the winning or losing[1]. Therefore, in baseball, it is easier to predict which team will win or lose compare to other sports. So in baseball, it is hard for underdog to beat overdog in real game.

There are three different types of teams stats, which are hitting stats, pitching stats and fielding stats. Hitting and pithcing stats are the standard stats that we can easily find while watching games. The batting stats include which team hits more and scores more, and the pitching stats include which team hits more and scores more to the opponents. Fielding stats are while in the defense innings, how many double plays or errors player made. Advanced stats that are used in MLB are BABIP(Batting Average on Balls in Play), DER(Defensive Efficiency Rating), ISO(teams' raw power), etc[2]. Few years ago, for the batting stat, MLB uses only simple batting average, for the pitching stat, they use simple ERA(Earned Run Average). But nowadays, as I mentioned above, MLB has few more advanced stats that are more useful to evaluate palyers' competence, also evaluate total teams' hitting, ptiching and fielding rank. With using those advanced stats, it is going to easier and accuracy to predict which teams are making a playoff with the regular season records.

From that idea, in this project, I would like to predict which teams make playoff using the given data and compare with the actual teams who make playoff. To find out which teams will make playoff, I need to figure out which variables on hitting or pitching data have the importance on the modeling. Also, if needed, find out more advanced variables that are used in the real baseball game. As MLB is still growing business, from this modeling, it helps fans, team general managers and sponsor companies to follow up the games. Thus, the goal of this project was to use classification to predict playoff through Logistic Regression, K-Nearest Neighbors(KNN), Decision Tree, Random Forest Classifier, XGBoost, Support Vector Machine given a input features relating to the batting, pitching and fielding(defense).

Data Description¶

The "Teams" dataset contains a total of 52 columns and it team make playoff by winning either division or wild card spot. First few columns show which teams, what division they included, either making playoff or not, and ball park name. Then, 4 columns show how many games each team played and wins/loss by teams. After that, I could divide columns by 3 categories, batting, pitching and fielding. In those columns, it includes each teams' average batting, pitching and fielding stats. To figure out which teams make playoff, column name DivWin and WCWin should be 'Y'. When teams win their division they automatically go to the playoff, and teams who win wilcard game they could also get playoff ticket. Therefore, using those two columns make a new column when team satisfy either coditions.

In [1]:
%pip install seaborn
import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("default")
sns.set(font_scale=1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.simplefilter(action = 'ignore',category = FutureWarning)
Requirement already satisfied: seaborn in c:\users\roymy\anaconda3\lib\site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\roymy\anaconda3\lib\site-packages (from seaborn) (1.24.3)
Requirement already satisfied: pandas>=0.25 in c:\users\roymy\anaconda3\lib\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\roymy\anaconda3\lib\site-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\roymy\appdata\roaming\python\python311\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\roymy\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\roymy\appdata\roaming\python\python311\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\roymy\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7)
Requirement already satisfied: six>=1.5 in c:\users\roymy\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
#Import the dataset
df = pd.read_csv("Teams.csv")
df.tail(5)
Out[2]:
yearID lgID teamID franchID divID Rank G Ghome W L ... DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
2980 2021 NL SLN STL C 2 162 81.0 90 72 ... 137 0.986 St. Louis Cardinals Busch Stadium III 2102530.0 92 92 STL SLN SLN
2981 2021 AL TBA TBD E 1 162 81.0 100 62 ... 130 0.986 Tampa Bay Rays Tropicana Field 761072.0 92 91 TBR TBA TBA
2982 2021 AL TEX TEX W 5 162 81.0 60 102 ... 146 0.986 Texas Rangers Globe Life Field 2110258.0 99 101 TEX TEX TEX
2983 2021 AL TOR TOR E 4 162 80.0 91 71 ... 122 0.984 Toronto Blue Jays Sahlen Field 805901.0 102 101 TOR TOR TOR
2984 2021 NL WAS WSN E 5 162 81.0 65 97 ... 116 0.983 Washington Nationals Nationals Park 1465543.0 95 96 WSN MON WAS

5 rows × 48 columns

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2985 entries, 0 to 2984
Data columns (total 48 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   yearID          2985 non-null   int64  
 1   lgID            2935 non-null   object 
 2   teamID          2985 non-null   object 
 3   franchID        2985 non-null   object 
 4   divID           1468 non-null   object 
 5   Rank            2985 non-null   int64  
 6   G               2985 non-null   int64  
 7   Ghome           2586 non-null   float64
 8   W               2985 non-null   int64  
 9   L               2985 non-null   int64  
 10  DivWin          1440 non-null   object 
 11  WCWin           804 non-null    object 
 12  LgWin           2957 non-null   object 
 13  WSWin           2628 non-null   object 
 14  R               2985 non-null   int64  
 15  AB              2985 non-null   int64  
 16  H               2985 non-null   int64  
 17  2B              2985 non-null   int64  
 18  3B              2985 non-null   int64  
 19  HR              2985 non-null   int64  
 20  BB              2984 non-null   float64
 21  SO              2969 non-null   float64
 22  SB              2859 non-null   float64
 23  CS              2153 non-null   float64
 24  HBP             1827 non-null   float64
 25  SF              1444 non-null   float64
 26  RA              2985 non-null   int64  
 27  ER              2985 non-null   int64  
 28  ERA             2985 non-null   float64
 29  CG              2985 non-null   int64  
 30  SHO             2985 non-null   int64  
 31  SV              2985 non-null   int64  
 32  IPouts          2985 non-null   int64  
 33  HA              2985 non-null   int64  
 34  HRA             2985 non-null   int64  
 35  BBA             2985 non-null   int64  
 36  SOA             2985 non-null   int64  
 37  E               2985 non-null   int64  
 38  DP              2985 non-null   int64  
 39  FP              2985 non-null   float64
 40  name            2985 non-null   object 
 41  park            2951 non-null   object 
 42  attendance      2706 non-null   float64
 43  BPF             2985 non-null   int64  
 44  PPF             2985 non-null   int64  
 45  teamIDBR        2985 non-null   object 
 46  teamIDlahman45  2985 non-null   object 
 47  teamIDretro     2985 non-null   object 
dtypes: float64(10), int64(25), object(13)
memory usage: 1.1+ MB

Data Preprocessing¶

Before beginning the data preprocessing, I imported the proper packages and loaded in the datasets. As dataset includes the year from 1871, few teams have changed their locations or disappeared from the league. Also, for the wildcard series has been established since 2012, there are few missing values for the WCWin. As this data set is about sport, except the team names, DivWin, WCWin, LgWin, and WSWin, most of the columns are numeric type. To check how big the data set is there are 143280 observations on the data set and there are only 7.02%, 10064 missing values. And to check which teams were existed since 1871, I have checked the team names.

In [4]:
#Check for missing data
print("The total number of data: ", df.shape[0]*df.shape[1])
print("The total number of null values: {} and it occupies {:.2f}% of the toal  ".format(df.isnull().sum().sum(), (df.isnull().sum().sum()*100)/(df.shape[0]*df.shape[1])))
print("The number of teams: ", df['franchID'].unique())
The total number of data:  143280
The total number of null values: 10064 and it occupies 7.02% of the toal  
The number of teams:  ['BNA' 'CNA' 'CFC' 'KEK' 'NNA' 'PNA' 'ROK' 'TRO' 'OLY' 'BLC' 'ECK' 'BRA'
 'MAN' 'NAT' 'MAR' 'RES' 'PWS' 'WBL' 'HNA' 'WES' 'NHV' 'CEN' 'SLR' 'SNA'
 'WNT' 'ATL' 'CHC' 'CNR' 'HAR' 'LGR' 'NYU' 'ATH' 'SBS' 'IBL' 'MLG' 'PRO'
 'BUF' 'CBL' 'SYR' 'TRT' 'WOR' 'DTN' 'BLO' 'CIN' 'LOU' 'PHA' 'PIT' 'STL'
 'CBK' 'SFG' 'NYP' 'PHI' 'ALT' 'BLU' 'LAD' 'BRD' 'CPI' 'COR' 'IHO' 'KCU'
 'MLU' 'PHK' 'RIC' 'SLM' 'STP' 'TOL' 'WIL' 'WST' 'WNA' 'KCN' 'WNL' 'CLV'
 'IND' 'KCC' 'CLS' 'BFB' 'BRG' 'BWW' 'BRS' 'CHP' 'CLI' 'NYI' 'PHQ' 'PBB'
 'ROC' 'SYS' 'TLM' 'CKK' 'MLA' 'WAS' 'NYY' 'BOS' 'CHW' 'CLE' 'DET' 'BAL'
 'OAK' 'MIN' 'BLT' 'BTT' 'BFL' 'CHH' 'NEW' 'KCP' 'PBS' 'SLI' 'ANA' 'TEX'
 'HOU' 'NYM' 'KCR' 'WSN' 'SDP' 'MIL' 'SEA' 'TOR' 'COL' 'FLA' 'ARI' 'TBD']

Most of teams that exist in these days have been pretty much fixed since 1990. Therefore, I would use the data set since 1990. In 1994, there was a players' strike in that season, therefore, there were no playoff and in 2020, as it was a Covid season, it has short season so those two seasons could make a noise for this data set. By that reason, I removed those two seasons.

In [5]:
df = df[df['yearID'] >= 1990] # Select recent 30 years seasons.
df = df[df['yearID'] != 1994] # No playoff season caused by players' STRIKE
df = df[df['yearID'] != 2020] # Short season 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df.head(10)
Out[5]:
yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H 2B 3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
2047 1990 NL ATL ATL W 6 162 81.0 65 97 N NaN N N 682 5504 1376 263 26 162 473.0 1010.0 92.0 55.0 27.0 31.0 821 727 4.58 17 8 30 4289 1527 128 579 938 158 133 0.974 Atlanta Braves Atlanta-Fulton County Stadium 980129.0 105 106 ATL ATL ATL
2048 1990 AL BAL BAL E 5 161 80.0 76 85 N NaN N N 669 5410 1328 234 22 132 660.0 962.0 94.0 52.0 40.0 41.0 698 644 4.04 10 5 43 4306 1445 161 537 776 93 151 0.985 Baltimore Orioles Memorial Stadium 2415189.0 97 98 BAL BAL BAL
2049 1990 AL BOS BOS E 1 162 81.0 88 74 Y NaN N N 699 5516 1502 298 31 106 598.0 795.0 53.0 52.0 28.0 44.0 664 596 3.72 15 13 44 4326 1439 92 519 997 123 154 0.980 Boston Red Sox Fenway Park II 2528986.0 105 105 BOS BOS BOS
2050 1990 AL CAL ANA W 4 162 81.0 80 82 N NaN N N 690 5570 1448 237 27 147 566.0 1000.0 69.0 43.0 28.0 45.0 706 613 3.79 21 13 42 4362 1482 106 544 944 142 186 0.978 California Angels Anaheim Stadium 2555688.0 97 97 CAL CAL CAL
2051 1990 AL CHA CHW W 2 162 80.0 94 68 N NaN N N 682 5402 1393 251 44 106 478.0 903.0 140.0 90.0 36.0 47.0 633 581 3.61 17 10 68 4348 1313 106 548 914 124 169 0.980 Chicago White Sox Comiskey Park 2002357.0 98 98 CHW CHA CHA
2052 1990 NL CHN CHC E 4 162 81.0 77 85 N NaN N N 690 5600 1474 240 36 136 406.0 869.0 151.0 50.0 30.0 51.0 774 695 4.34 13 7 42 4328 1510 121 572 877 124 136 0.980 Chicago Cubs Wrigley Field 2243791.0 108 108 CHC CHN CHN
2053 1990 NL CIN CIN W 1 162 81.0 91 71 Y NaN Y Y 693 5525 1466 284 40 125 466.0 913.0 166.0 66.0 42.0 42.0 597 549 3.39 14 12 50 4369 1338 124 543 1029 102 126 0.983 Cincinnati Reds Riverfront Stadium 2400892.0 105 105 CIN CIN CIN
2054 1990 AL CLE CLE E 4 162 81.0 77 85 N NaN N N 732 5485 1465 266 41 110 458.0 836.0 107.0 52.0 29.0 61.0 737 676 4.26 12 10 47 4282 1491 163 518 860 117 146 0.981 Cleveland Indians Cleveland Stadium 1225240.0 100 100 CLE CLE CLE
2055 1990 AL DET DET E 3 162 81.0 79 83 N NaN N N 750 5479 1418 241 32 172 634.0 952.0 82.0 57.0 34.0 41.0 754 697 4.39 15 12 45 4291 1401 154 661 856 131 178 0.979 Detroit Tigers Tiger Stadium 1495785.0 101 102 DET DET DET
2056 1990 NL HOU HOU W 4 162 81.0 75 87 N NaN N N 573 5379 1301 209 32 94 548.0 997.0 179.0 83.0 28.0 41.0 656 581 3.61 12 6 37 4350 1396 130 496 854 131 124 0.978 Houston Astros Astrodome 1310927.0 97 98 HOU HOU HOU
In [6]:
#reset the index.
df = df.reset_index()
df = df.drop(["index"], axis=1)
pd.set_option('display.max_columns', None)
df.head(5)
df.isnull().sum()
Out[6]:
yearID              0
lgID                0
teamID              0
franchID            0
divID               0
Rank                0
G                   0
Ghome               0
W                   0
L                   0
DivWin              0
WCWin             106
LgWin               0
WSWin               0
R                   0
AB                  0
H                   0
2B                  0
3B                  0
HR                  0
BB                  0
SO                  0
SB                  0
CS                  0
HBP                 0
SF                  0
RA                  0
ER                  0
ERA                 0
CG                  0
SHO                 0
SV                  0
IPouts              0
HA                  0
HRA                 0
BBA                 0
SOA                 0
E                   0
DP                  0
FP                  0
name                0
park                0
attendance          0
BPF                 0
PPF                 0
teamIDBR            0
teamIDlahman45      0
teamIDretro         0
dtype: int64

After finding what missing values were in the data set, there are only missing values on the WCWin. At that time, if the team win the DivWin, they make a playoff, so I have changed the missing values on the WCWin columns to N.

In [7]:
# The wildcard system had been entered in since 2012
df["WCWin"].fillna("N", inplace = True)
df["WCWin"].head(120)
Out[7]:
0      N
1      N
2      N
3      N
4      N
5      N
6      N
7      N
8      N
9      N
10     N
11     N
12     N
13     N
14     N
15     N
16     N
17     N
18     N
19     N
20     N
21     N
22     N
23     N
24     N
25     N
26     N
27     N
28     N
29     N
30     N
31     N
32     N
33     N
34     N
35     N
36     N
37     N
38     N
39     N
40     N
41     N
42     N
43     N
44     N
45     N
46     N
47     N
48     N
49     N
50     N
51     N
52     N
53     N
54     N
55     N
56     N
57     N
58     N
59     N
60     N
61     N
62     N
63     N
64     N
65     N
66     N
67     N
68     N
69     N
70     N
71     N
72     N
73     N
74     N
75     N
76     N
77     N
78     N
79     N
80     N
81     N
82     N
83     N
84     N
85     N
86     N
87     N
88     N
89     N
90     N
91     N
92     N
93     N
94     N
95     N
96     N
97     N
98     N
99     N
100    N
101    N
102    N
103    N
104    N
105    N
106    N
107    N
108    N
109    N
110    N
111    N
112    N
113    N
114    Y
115    N
116    N
117    N
118    N
119    N
Name: WCWin, dtype: object

As teams have changed their franchID, I have checked the teams that were on the same location but different name, so change those teams into LA Angels, Chicago White Sox, Miami Marlins and Tempa Bay Rays

In [8]:
# Check the duplicated values and erase one row, going to use franchID and change the name (refers to MLB Team abbreviations)
print(df['franchID'].unique())

df['franchID'] = df['franchID'].replace({'ANA' : 'LAA', 'CHW' : 'CWS', 'FLA' : 'MIA', 'TBD' : 'TBR'})
# To check that the franchID has been replaced, then franchID only have 30 teams.
print(df['franchID'].unique())
['ATL' 'BAL' 'BOS' 'ANA' 'CHW' 'CHC' 'CIN' 'CLE' 'DET' 'HOU' 'KCR' 'LAD'
 'MIN' 'MIL' 'WSN' 'NYY' 'NYM' 'OAK' 'PHI' 'PIT' 'SDP' 'SEA' 'SFG' 'STL'
 'TEX' 'TOR' 'COL' 'FLA' 'ARI' 'TBD']
['ATL' 'BAL' 'BOS' 'LAA' 'CWS' 'CHC' 'CIN' 'CLE' 'DET' 'HOU' 'KCR' 'LAD'
 'MIN' 'MIL' 'WSN' 'NYY' 'NYM' 'OAK' 'PHI' 'PIT' 'SDP' 'SEA' 'SFG' 'STL'
 'TEX' 'TOR' 'COL' 'MIA' 'ARI' 'TBR']

Column descriptions:¶

[Hitting]¶

1B: singles BA: Batting Average (The ratio of hits to at-bats) OBP: On-Base Percentage (The percentage of plate appearances resulting in the batter reaching bases) SLG: Slugging Percentage (A measure of the team's power-hitting ability, calculated as total bases divided by at-bats) TB: Total Bases (The sum of bases earned through singles, doubles, triples, and home runs) OPS: On-Base Plus Slugging (The sum of OBP and SLG) GPA: Gross Production Average (A measure of a player's overall offensive production, combining OBP and SLG) TA: Total Average (A metric considering total bases, walks, hit by pitch, stolen bases, and other factors per plate appearance) PSN: Power-Speed Number (A combined measure of player's home run and stolen base proficiency) ISO: Isolated Power (A measure of a team's raw power, calculated as SLG minus BA) BABIP: Batting Average on Balls in Play (The ratio of walks and hit by pitch to toal at-bats)

[Pitching]¶

WHIP: Walks plus Hits per Inning Pitched (A measure of a pitcher's effectiveness in preventing baserunners) BAA: Batting Average Against (The opposing batters' batting average against the team's pitchers) K/BB: Strikeouts per Walk (The ratio of strikeouts to walks, indicating a pitcher's control and dominance) BB/HBP_ratio: Walks and Hit by Pitch Ratio (The ratio of walks and hit by pitch to total at-bats) IP: Inning Pitched (The total number of innings pitched by the team's pitchers, converted from outs)

[Fielding]¶

FP : Fielding Percentage

[Team]¶

P%: Power Percentage (The percentage of runs squared divided by the sum of runs squared and runs allowed squared, providing a measure of a team's power) WP: The ratio of wins to total games played, indicating the team's winning rate.

In [9]:
#Add WP, BA, 1B, OBP, SLG, OPS, IP(IPouts/3), WHIP , TB, GPA, TA, PSN, ISO, BABIP, P%
#Add BAA(Batting Average Against), (BB/HBP_ratio = walk and hit by pitch ratio) , K/BB
df["WP"] = round(df["W"]/df["G"],3)
df["P%"] = round((df["R"]**2)/(df["R"]**2+df["RA"]**2),2)
df["BA"] = round(df["H"]/df["AB"],3)
df["1B"] = df["H"] - df["HR"] - df["3B"] - df["2B"]
df["OBP"] = round((df["H"] + df["BB"] + df["HBP"] + df["SF"])/(df["AB"] + df["BB"] + df["HBP"] + df["SF"]),3)
df["SLG"] = round((df["1B"] + 2*df["2B"] + 3*df["3B"] + 4*df["HR"])/df["AB"],3)
df["TB"] = df["1B"] + 2*df["2B"] +3*df["3B"] + 4*df["HR"]
df["OPS"] = round(df["OBP"] + df["SLG"],3)
df["GPA"] = round((1.8*df["OBP"]+df["SLG"])/4,3)
df["TA"] = round((df["TB"]+df["HBP"]+df["BB"]+df["SB"])/(df["AB"]-df["H"]+df["CS"]+df["DP"]),3)
df["PSN"] = round((df["HR"]*df["SB"]*2)/(df["HR"]+df["SB"]),3)
df["IP"] = round(df["IPouts"]/3,2)
df["WHIP"] = round((df["HA"] + df["BBA"])/(df["IP"]),3)
df["BAA"] = round(df["HA"]/(df["HA"]+df["IP"]),3)
df["K/BB"] = round(df["SO"]/df["BB"],3)
df["ISO"] = df["SLG"]-df["BA"]
df["BB/HBP_ratio"] = round((df["BB"] + df["HBP"])/df["AB"],3)
df["BABIP"] = round((df["H"]-df["HR"])/(df["AB"]-df["SO"]-df["HR"]+df["SF"]),3)
df.head(10)
Out[9]:
yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H 2B 3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro WP P% BA 1B OBP SLG TB OPS GPA TA PSN IP WHIP BAA K/BB ISO BB/HBP_ratio BABIP
0 1990 NL ATL ATL W 6 162 81.0 65 97 N N N N 682 5504 1376 263 26 162 473.0 1010.0 92.0 55.0 27.0 31.0 821 727 4.58 17 8 30 4289 1527 128 579 938 158 133 0.974 Atlanta Braves Atlanta-Fulton County Stadium 980129.0 105 106 ATL ATL ATL 0.401 0.41 0.250 925 0.316 0.396 2177 0.712 0.241 0.642 117.354 1429.67 1.473 0.516 2.135 0.146 0.091 0.278
1 1990 AL BAL BAL E 5 161 80.0 76 85 N N N N 669 5410 1328 234 22 132 660.0 962.0 94.0 52.0 40.0 41.0 698 644 4.04 10 5 43 4306 1445 161 537 776 93 151 0.985 Baltimore Orioles Memorial Stadium 2415189.0 97 98 BAL BAL BAL 0.472 0.48 0.245 940 0.336 0.370 2002 0.706 0.244 0.653 109.805 1435.33 1.381 0.502 1.458 0.125 0.129 0.275
2 1990 AL BOS BOS E 1 162 81.0 88 74 Y N N N 699 5516 1502 298 31 106 598.0 795.0 53.0 52.0 28.0 44.0 664 596 3.72 15 13 44 4326 1439 92 519 997 123 154 0.980 Boston Red Sox Fenway Park II 2528986.0 105 105 BOS BOS BOS 0.543 0.53 0.272 1067 0.351 0.395 2180 0.746 0.257 0.677 70.667 1442.00 1.358 0.499 1.329 0.123 0.113 0.300
3 1990 AL CAL LAA W 4 162 81.0 80 82 N N N N 690 5570 1448 237 27 147 566.0 1000.0 69.0 43.0 28.0 45.0 706 613 3.79 21 13 42 4362 1482 106 544 944 142 186 0.978 California Angels Anaheim Stadium 2555688.0 97 97 CAL CAL CAL 0.494 0.49 0.260 1037 0.336 0.391 2180 0.727 0.249 0.653 93.917 1454.00 1.393 0.505 1.767 0.131 0.107 0.291
4 1990 AL CHA CWS W 2 162 80.0 94 68 N N N N 682 5402 1393 251 44 106 478.0 903.0 140.0 90.0 36.0 47.0 633 581 3.61 17 10 68 4348 1313 106 548 914 124 169 0.980 Chicago White Sox Comiskey Park 2002357.0 98 98 CHW CHA CHA 0.580 0.54 0.258 992 0.328 0.379 2050 0.707 0.242 0.634 120.650 1449.33 1.284 0.475 1.889 0.121 0.095 0.290
5 1990 NL CHN CHC E 4 162 81.0 77 85 N N N N 690 5600 1474 240 36 136 406.0 869.0 151.0 50.0 30.0 51.0 774 695 4.34 13 7 42 4328 1510 121 572 877 124 136 0.980 Chicago Cubs Wrigley Field 2243791.0 108 108 CHC CHN CHN 0.475 0.44 0.263 1062 0.322 0.392 2194 0.714 0.243 0.645 143.108 1442.67 1.443 0.511 2.140 0.129 0.078 0.288
6 1990 NL CIN CIN W 1 162 81.0 91 71 Y N Y Y 693 5525 1466 284 40 125 466.0 913.0 166.0 66.0 42.0 42.0 597 549 3.39 14 12 50 4369 1338 124 543 1029 102 126 0.983 Cincinnati Reds Riverfront Stadium 2400892.0 105 105 CIN CIN CIN 0.562 0.57 0.265 1017 0.332 0.399 2205 0.731 0.249 0.677 142.612 1456.33 1.292 0.479 1.959 0.134 0.092 0.296
7 1990 AL CLE CLE E 4 162 81.0 77 85 N N N N 732 5485 1465 266 41 110 458.0 836.0 107.0 52.0 29.0 61.0 737 676 4.26 12 10 47 4282 1491 163 518 860 117 146 0.981 Cleveland Indians Cleveland Stadium 1225240.0 100 100 CLE CLE CLE 0.475 0.50 0.267 1048 0.334 0.391 2143 0.725 0.248 0.649 108.479 1427.33 1.408 0.511 1.825 0.124 0.089 0.295
8 1990 AL DET DET E 3 162 81.0 79 83 N N N N 750 5479 1418 241 32 172 634.0 952.0 82.0 57.0 34.0 41.0 754 697 4.39 15 12 45 4291 1401 154 661 856 131 178 0.979 Detroit Tigers Tiger Stadium 1495785.0 101 102 DET DET DET 0.488 0.50 0.259 973 0.344 0.409 2239 0.753 0.257 0.696 111.055 1430.33 1.442 0.495 1.502 0.150 0.122 0.283
9 1990 NL HOU HOU W 4 162 81.0 75 87 N N N N 573 5379 1301 209 32 94 548.0 997.0 179.0 83.0 28.0 41.0 656 581 3.61 12 6 37 4350 1396 130 496 854 131 124 0.978 Houston Astros Astrodome 1310927.0 97 98 HOU HOU HOU 0.463 0.43 0.242 966 0.320 0.345 1856 0.665 0.230 0.609 123.267 1450.00 1.305 0.491 1.819 0.103 0.107 0.279
In [10]:
#Check which teams are qualifed for playoff
df["make_playoffs_rank_1"] = df["Rank"]==1
df["make_playoffs_wild_card"] = df["WCWin"]=="Y"
df["make_playoffs_win_division"] = df["DivWin"]=="Y"
df["make_playoffs"] = df["make_playoffs_rank_1"] | df["make_playoffs_wild_card"] | df["make_playoffs_win_division"]
df = pd.get_dummies(df, columns = ["make_playoffs"], drop_first = True)
df.head(10)
Out[10]:
yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H 2B 3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro WP P% BA 1B OBP SLG TB OPS GPA TA PSN IP WHIP BAA K/BB ISO BB/HBP_ratio BABIP make_playoffs_rank_1 make_playoffs_wild_card make_playoffs_win_division make_playoffs_True
0 1990 NL ATL ATL W 6 162 81.0 65 97 N N N N 682 5504 1376 263 26 162 473.0 1010.0 92.0 55.0 27.0 31.0 821 727 4.58 17 8 30 4289 1527 128 579 938 158 133 0.974 Atlanta Braves Atlanta-Fulton County Stadium 980129.0 105 106 ATL ATL ATL 0.401 0.41 0.250 925 0.316 0.396 2177 0.712 0.241 0.642 117.354 1429.67 1.473 0.516 2.135 0.146 0.091 0.278 False False False 0
1 1990 AL BAL BAL E 5 161 80.0 76 85 N N N N 669 5410 1328 234 22 132 660.0 962.0 94.0 52.0 40.0 41.0 698 644 4.04 10 5 43 4306 1445 161 537 776 93 151 0.985 Baltimore Orioles Memorial Stadium 2415189.0 97 98 BAL BAL BAL 0.472 0.48 0.245 940 0.336 0.370 2002 0.706 0.244 0.653 109.805 1435.33 1.381 0.502 1.458 0.125 0.129 0.275 False False False 0
2 1990 AL BOS BOS E 1 162 81.0 88 74 Y N N N 699 5516 1502 298 31 106 598.0 795.0 53.0 52.0 28.0 44.0 664 596 3.72 15 13 44 4326 1439 92 519 997 123 154 0.980 Boston Red Sox Fenway Park II 2528986.0 105 105 BOS BOS BOS 0.543 0.53 0.272 1067 0.351 0.395 2180 0.746 0.257 0.677 70.667 1442.00 1.358 0.499 1.329 0.123 0.113 0.300 True False True 1
3 1990 AL CAL LAA W 4 162 81.0 80 82 N N N N 690 5570 1448 237 27 147 566.0 1000.0 69.0 43.0 28.0 45.0 706 613 3.79 21 13 42 4362 1482 106 544 944 142 186 0.978 California Angels Anaheim Stadium 2555688.0 97 97 CAL CAL CAL 0.494 0.49 0.260 1037 0.336 0.391 2180 0.727 0.249 0.653 93.917 1454.00 1.393 0.505 1.767 0.131 0.107 0.291 False False False 0
4 1990 AL CHA CWS W 2 162 80.0 94 68 N N N N 682 5402 1393 251 44 106 478.0 903.0 140.0 90.0 36.0 47.0 633 581 3.61 17 10 68 4348 1313 106 548 914 124 169 0.980 Chicago White Sox Comiskey Park 2002357.0 98 98 CHW CHA CHA 0.580 0.54 0.258 992 0.328 0.379 2050 0.707 0.242 0.634 120.650 1449.33 1.284 0.475 1.889 0.121 0.095 0.290 False False False 0
5 1990 NL CHN CHC E 4 162 81.0 77 85 N N N N 690 5600 1474 240 36 136 406.0 869.0 151.0 50.0 30.0 51.0 774 695 4.34 13 7 42 4328 1510 121 572 877 124 136 0.980 Chicago Cubs Wrigley Field 2243791.0 108 108 CHC CHN CHN 0.475 0.44 0.263 1062 0.322 0.392 2194 0.714 0.243 0.645 143.108 1442.67 1.443 0.511 2.140 0.129 0.078 0.288 False False False 0
6 1990 NL CIN CIN W 1 162 81.0 91 71 Y N Y Y 693 5525 1466 284 40 125 466.0 913.0 166.0 66.0 42.0 42.0 597 549 3.39 14 12 50 4369 1338 124 543 1029 102 126 0.983 Cincinnati Reds Riverfront Stadium 2400892.0 105 105 CIN CIN CIN 0.562 0.57 0.265 1017 0.332 0.399 2205 0.731 0.249 0.677 142.612 1456.33 1.292 0.479 1.959 0.134 0.092 0.296 True False True 1
7 1990 AL CLE CLE E 4 162 81.0 77 85 N N N N 732 5485 1465 266 41 110 458.0 836.0 107.0 52.0 29.0 61.0 737 676 4.26 12 10 47 4282 1491 163 518 860 117 146 0.981 Cleveland Indians Cleveland Stadium 1225240.0 100 100 CLE CLE CLE 0.475 0.50 0.267 1048 0.334 0.391 2143 0.725 0.248 0.649 108.479 1427.33 1.408 0.511 1.825 0.124 0.089 0.295 False False False 0
8 1990 AL DET DET E 3 162 81.0 79 83 N N N N 750 5479 1418 241 32 172 634.0 952.0 82.0 57.0 34.0 41.0 754 697 4.39 15 12 45 4291 1401 154 661 856 131 178 0.979 Detroit Tigers Tiger Stadium 1495785.0 101 102 DET DET DET 0.488 0.50 0.259 973 0.344 0.409 2239 0.753 0.257 0.696 111.055 1430.33 1.442 0.495 1.502 0.150 0.122 0.283 False False False 0
9 1990 NL HOU HOU W 4 162 81.0 75 87 N N N N 573 5379 1301 209 32 94 548.0 997.0 179.0 83.0 28.0 41.0 656 581 3.61 12 6 37 4350 1396 130 496 854 131 124 0.978 Houston Astros Astrodome 1310927.0 97 98 HOU HOU HOU 0.463 0.43 0.242 966 0.320 0.345 1856 0.665 0.230 0.609 123.267 1450.00 1.305 0.491 1.819 0.103 0.107 0.279 False False False 0
In [11]:
df.groupby("yearID")["make_playoffs_True"].value_counts()
Out[11]:
yearID  make_playoffs_True
1990    0                     22
        1                      4
1991    0                     22
        1                      4
1992    0                     22
        1                      4
1993    0                     24
        1                      4
1995    0                     20
        1                      8
1996    0                     20
        1                      8
1997    0                     20
        1                      8
1998    0                     22
        1                      8
1999    0                     22
        1                      8
2000    0                     22
        1                      8
2001    0                     22
        1                      8
2002    0                     22
        1                      8
2003    0                     22
        1                      8
2004    0                     22
        1                      8
2005    0                     22
        1                      8
2006    0                     22
        1                      8
2007    0                     22
        1                      8
2008    0                     22
        1                      8
2009    0                     22
        1                      8
2010    0                     22
        1                      8
2011    0                     22
        1                      8
2012    0                     20
        1                     10
2013    0                     20
        1                     10
2014    0                     20
        1                     10
2015    0                     20
        1                     10
2016    0                     20
        1                     10
2017    0                     20
        1                     10
2018    0                     20
        1                     10
2019    0                     20
        1                     10
2021    0                     20
        1                     10
Name: make_playoffs_True, dtype: int64

EDA¶

Correlation Map¶

In the visualization part, first I created a correlation matrix/heatmap of all the numeric variables in the data frame. It easily shows the correlation between the variables. If the correlation is high between each others, one may need to be removed during the building the model to avoid multicollinearity problem. As new variables are built using the exist columns, such as OPS, OBP, WHIP are made by the variables that already exist such as how many hits(homeruns) they made or ERA. There should be some high correlations then I should figure out which variables that I would like to use mostly in this model building step.

In [12]:
#make a Correlation map of all the columns in the data set
plt.subplots(figsize=(20,20))
plt.title("Correlation Matrix of Baseball data")
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, annot_kws = {'size':5}, cmap = "YlGnBu")
Out[12]:
<Axes: title={'center': 'Correlation Matrix of Baseball data'}>

Divide it into 2 correlation map to improve "READABILITY"¶

As it has a lot of columns, I would like to divide into 2 seperate correlations to make heatmap readability.

In [13]:
df.shape[1]
Out[13]:
70
In [14]:
df1 = df.iloc[:, :37]
df2 = df.iloc[:, 37:]
In [15]:
plt.figure(figsize=(10, 10))
sns.heatmap(df1.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap - Part 1')
plt.show()

plt.figure(figsize=(10, 10))
sns.heatmap(df2.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap - Part 2')
plt.show()

Visualization¶

1990 - 2021¶

Figure below shows from 1990 to 2021, I would like to figure out which teams qualified playoff the most and the least. New York Yankees appeared the most with more than 20 times and Atlanta Braves appeard the second most time with about 20 times. Miami Marlins and Kansas City Royals appeared the least with only 2 times.

In [16]:
#Make a barplot to check who qualifed the playoff in 1990-2021
playoff_qual = df[df["make_playoffs_True"]==1]["franchID"]
plt.subplots(figsize=(15,5))
plt.hist(playoff_qual, bins=100)
plt.title("Playoff Qualifiers franchises (1990-2021)")
Out[16]:
Text(0.5, 1.0, 'Playoff Qualifiers franchises (1990-2021)')

Comparions of Statistics with Playoff Qualifiers and Non Qualifiers¶

The figures below show comparisons of most of the variables that I made during the data preprocessing step between teams that qualified for the playoffs and teams that could not qualify for the playoffs.

Winning Percentage which shows the total team's result for season 1990-2021, as we can expect easily that teams that qualified playoff have higher winning percentage, even the team has the lowest winning percentage still have better percentage than the average percentage of the teams that could not qualify the playoff. Power Percentage is the stats that from runs and runs allowed. It can show us the approximate relation between runs and runs allowed. Additionally, it demonstrates the significant difference between the teams that made the playoffs and those that couldn't make the playoffs. The better teams have a higher chance to score more and allow fewer runs, and this connects to winning the game.

For Batting Average, there are not much different between the teams that make playoff and not making playoff. The difference is approximately 0.007, and the highest batting average team that could not make playoff is almost same or higher than the highest batting average team who make playoff. Then compare the Hitters Statistics which are OBP, SLG, OPS, GPA, TA, BABIP, PSN, ISO, and BB/HBP_ratio. Most of the statistics that the team who made a playoff are slightly higher than the team who couldn't make a playoff, but it doesn't show me the bigger gap or dramatic difference between those teams than I expected. But when we look at TA and GPA, which are Total Average and Gross Production Average, they cover most of the hitting statistics and show the bigger difference between the teams that make the playoffs and the teams that couldn't make the playoffs. I could possibly say that each individual stat doesn't give me the dramatic gap between the two categories, but when we use advanced statistics, even small differences can make a bigger gap.

There are also Pittching Statistics which are K/BB, BAA, ERA, and WHIP. An interesting result was shown in K/BB. Usually, a better team has a higher K/BB ratio, but in our plot, it shows that the teams that couldn't make the playoffs have a higher K/BB ratio. Furthermore, the range of K/BB for the non-qualifying teams is higher than that of the qualifying teams. Except for K/BB, the other statistics—BAA, ERA, and WHIP—show typical plots. Less ERA, BAA and WHIP show the better pitcher teams in Baseball.

Fielding Percentage shows how well the player performs when they are on the pitch inning. It measures the number of successfully fielded balls (putouts and assists) divided by the number of opportunities (putouts, assists, and errors). It doesn't show a significant difference between the teams that make the playoffs and the teams that couldn't make the playoffs. However, the lower quartile of the teams that couldn't make the playoffs is much lower than that of the teams that make the playoffs.

To summarize my findings from my visualization, most of the teams' batting statistics are not much different between the teams that make the playoffs and the teams that couldn't make the playoffs. However, there are differences in the pitching statistics. Therefore, I could say that to be a better team, the pitchers are more important than the hitters to make a playoff.

In [17]:
import matplotlib.pyplot as plt
import seaborn as sns

# List of statistics to analyze
statistics = ['WP', 'P%', 'BA', 'OBP', 'SLG', 'OPS', 'GPA', 'TA', 'BABIP', 'PSN', 'ISO', 'BB/HBP_ratio', 
              'K/BB', 'BAA', 'ERA', 'WHIP', 
              'FP']

# Iterate through each statistic
for stat in statistics:
    # Calculate the average for playoff teams
    average_playoff_teams = df[df["make_playoffs_True"] == 1][stat].mean()

    # Calculate the average for non-playoff teams
    average_non_playoff_teams = df[df["make_playoffs_True"] == 0][stat].mean()

    # Print the results
    print(f"Average {stat} for playoff teams: {average_playoff_teams}")
    print(f"Average {stat} for non-playoff teams: {average_non_playoff_teams}")

    # Create a boxplot for the current statistic
    plt.figure(figsize=(8, 6))
    sns.boxplot(x="make_playoffs_True", y=stat, data=df)
    plt.xticks([0, 1], ["No Playoff", "Playoff"])
    plt.xlabel("Playoff Qualification")
    plt.ylabel(stat)
    plt.title(f"Comparison of {stat} between Playoff and Non-Playoff Teams")
    plt.show()
Average WP for playoff teams: 0.583107438016529
Average WP for non-playoff teams: 0.4683495297805642
Average P% for playoff teams: 0.5793388429752065
Average P% for non-playoff teams: 0.4706112852664577
Average BA for playoff teams: 0.26522727272727276
Average BA for non-playoff teams: 0.25834012539184953
Average OBP for playoff teams: 0.3448925619834711
Average OBP for non-playoff teams: 0.3322836990595612
Average SLG for playoff teams: 0.4299504132231405
Average SLG for non-playoff teams: 0.40727115987460816
Average OPS for playoff teams: 0.7748429752066116
Average OPS for non-playoff teams: 0.7395548589341693
Average GPA for playoff teams: 0.2626776859504132
Average GPA for non-playoff teams: 0.25135736677115983
Average TA for playoff teams: 0.731396694214876
Average TA for non-playoff teams: 0.6808275862068967
Average BABIP for playoff teams: 0.3000413223140495
Average BABIP for non-playoff teams: 0.29550313479623824
Average PSN for playoff teams: 122.8812479338843
Average PSN for non-playoff teams: 116.20618181818183
Average ISO for playoff teams: 0.16472314049586775
Average ISO for non-playoff teams: 0.14893103448275863
Average BB/HBP_ratio for playoff teams: 0.11323553719008266
Average BB/HBP_ratio for non-playoff teams: 0.10300156739811912
Average K/BB for playoff teams: 2.011099173553719
Average K/BB for non-playoff teams: 2.215307210031348
Average BAA for playoff teams: 0.4877148760330579
Average BAA for non-playoff teams: 0.5029639498432601
Average ERA for playoff teams: 3.8762396694214876
Average ERA for non-playoff teams: 4.39628526645768
Average WHIP for playoff teams: 1.2971983471074382
Average WHIP for non-playoff teams: 1.3913150470219435
Average FP for playoff teams: 0.9838595041322314
Average FP for non-playoff teams: 0.9822727272727274

Team Bating statistics¶

In [18]:
#Make the plot to see OPS, GPA, TA, ISO, BABIP, and P% per Each Team
plt.figure(figsize=(12,5))

# Calculate the average of each metric for each team and year
team_batting_statistics = df.groupby(['franchID'])[['OPS','GPA','TA','ISO','BABIP','P%']].mean()

# Separate playoff qualifiers and non-qualifiers for each year
playoff_qualifiers = df[df['make_playoffs_True'] == 1].groupby(['franchID']).mean()
non_qualifiers = df[df['make_playoffs_True'] == 0].groupby(['franchID']).mean()

# Make the chart look pretty
plt.figure(figsize=(20, 10))

# Melt the DataFrame to reformat it for barplot
team_statistic_melted1 = team_batting_statistics.reset_index().melt(id_vars='franchID', var_name='Statistic', value_name='Value')

# Draw the barplot
ax = sns.barplot(data=team_statistic_melted1, x='franchID', y='Value', hue='Statistic', palette="pastel")
plt.xticks(rotation=50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Average OPS, GPA, TA, ISO, BABIP, and P% per Team")
plt.xlabel("Team")
plt.ylabel("Statistics")
plt.show()
<Figure size 1200x500 with 0 Axes>

Team Pitching statistics¶

In [19]:
#Make the plot to see WHIP, WP, BAA, K/BB, BB/HBP_ratio per Each Team
plt.figure(figsize=(12,5))

# Calculate the average of each metric for each team and year
team_pitching_statistics = df.groupby(['franchID'])[['WHIP','BAA','K/BB','BB/HBP_ratio']].mean()

# Separate playoff qualifiers and non-qualifiers for each year
playoff_qualifiers = df[df['make_playoffs_True'] == 1].groupby(['franchID']).mean()
non_qualifiers = df[df['make_playoffs_True'] == 0].groupby(['franchID']).mean()

# Make the chart look pretty
plt.figure(figsize=(20, 10))

# Melt the DataFrame to reformat it for barplot
team_statistic_melted2 = team_pitching_statistics.reset_index().melt(id_vars='franchID', var_name='Statistic', value_name='Value')

# Draw the barplot
ax = sns.barplot(data=team_statistic_melted2, x='franchID', y='Value', hue='Statistic', palette="pastel")
plt.xticks(rotation=50)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.title("Average WHIP, BAA, K/BB, BB/HBP_ratio per Team")
plt.xlabel("Team")
plt.ylabel("Statistics")
plt.show()
<Figure size 1200x500 with 0 Axes>

Data Pre_Processing For Model Built Out¶

Before I build a model, from the heatmap that I plotted above, I would like to remove the columns that are highly correlated or the columns that are categorized or the columns that seem not meaningful for the model. After remove the columns I found that my new data frame have 880 observations with 13 columns that total data is 11440. It still can be use train-test split without resampling method(K-fold cross validation), but as the observation is not as much as I expected, therefore, I would like to use 10 folds cross validation to make data frame more meaningful. The variables that I created from the preprocessing steps, which are OBP, SLG, and OPS, are all related to GPA; therefore, I removed those variables. After that, I seperated out my predictor variable which is make_playoffs_True.

In [20]:
#Removing unnecessary columns to do modeling.
columns_to_drop = ["yearID","G","lgID","franchID","divID","Ghome","W","L","AB","H","1B","2B","3B","BB",
                   "SB","CG","SHO","SV","IPouts","R","RA","ER","teamID","name","park",
                   "attendance","DivWin","WCWin","LgWin","WSWin","teamIDBR","teamIDlahman45","teamIDretro",
                   "make_playoffs_rank_1","make_playoffs_wild_card","make_playoffs_win_division","PPF","BPF",
                   "HA","HRA","BBA","SO","SOA","E","CS","SF","DP","BA","Rank","HR","HBP","TB","IP", "WP", "OBP", "SLG", "OPS"]
ndf = df.drop(columns =columns_to_drop, inplace = False)
ndf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 880 entries, 0 to 879
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ERA                 880 non-null    float64
 1   FP                  880 non-null    float64
 2   P%                  880 non-null    float64
 3   GPA                 880 non-null    float64
 4   TA                  880 non-null    float64
 5   PSN                 880 non-null    float64
 6   WHIP                880 non-null    float64
 7   BAA                 880 non-null    float64
 8   K/BB                880 non-null    float64
 9   ISO                 880 non-null    float64
 10  BB/HBP_ratio        880 non-null    float64
 11  BABIP               880 non-null    float64
 12  make_playoffs_True  880 non-null    uint8  
dtypes: float64(12), uint8(1)
memory usage: 83.5 KB
In [21]:
#make a heatmap of the important columns to predict the playoff in the data set // 21 columns - 7 
plt.subplots(figsize=(10,10))
plt.title("Correlation Matrix of Baseball data")
sns.heatmap(ndf.corr(), vmin=-1, vmax=1, annot=True, annot_kws = {'size':15}, cmap = "YlGnBu")
Out[21]:
<Axes: title={'center': 'Correlation Matrix of Baseball data'}>
In [22]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Separate features (X) and target variable (y)
features = ['BB/HBP_ratio', 'K/BB', 'BAA', 'BABIP', 'PSN', 'TA', 'GPA', 'ERA', 'WHIP', 'FP', "ISO",  "P%"]
X = ndf[features]
y = ndf["make_playoffs_True"]

# Initialize k-fold cross-validator
n_splits = 10

kf = KFold(n_splits=n_splits, shuffle = True, random_state=42)
# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Build-Out¶

a. Logistic Regression¶

b. Random Forest Classifier¶

c. XGBoost¶

d. Support Vector Machine¶

e. K-Nearest Neighbors, KNN¶

a. Logistic Regression¶

In [23]:
from sklearn.linear_model import LogisticRegression

# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Create and train the logistic regression model
    lr = LogisticRegression()
    lr.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred_lr = lr.predict(X_test)

    # Evaluate the model's performance
    accuracy_lr = accuracy_score(y_test, y_pred_lr)
    classification_rep_lr = classification_report(y_test, y_pred_lr)
    conf_matrix = confusion_matrix(y_test, y_pred_lr)
    
    # Append evaluation metrics to respective lists
    accuracy_scores.append(accuracy_lr)
    classification_reports.append(classification_rep_lr)
    confusion_matrices.append(conf_matrix)

# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)

# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_lr)
print("Classification Report:")
print(classification_rep_lr)
print("Confusion Matrix:")
print(conf_matrix)
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Mean Accuracy: 0.8139034205231388
Accuracy: 0.8571428571428571
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.92      0.91        53
           1       0.73      0.65      0.69        17

    accuracy                           0.86        70
   macro avg       0.81      0.79      0.80        70
weighted avg       0.85      0.86      0.85        70

Confusion Matrix:
[[49  4]
 [ 6 11]]
In [24]:
# Figure out meaningful features.
from sklearn.feature_selection import RFE
lrrfe = LogisticRegression()

rfe = RFE(lrrfe, n_features_to_select=10)
fit = rfe.fit(X_train, y_train)

selected_features = fit.support_
print("Selected Features:", selected_features)
Selected Features: [ True  True  True  True False  True  True  True  True False  True  True]

b. Random Forest Classifier¶

In [25]:
from sklearn.ensemble import RandomForestClassifier

# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []

# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Create and train the Random Forest classifier model
    rfc = RandomForestClassifier()
    rfc.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred_rfc = rfc.predict(X_test)

    # Evaluate the model's performance
    accuracy_rfc = accuracy_score(y_test, y_pred_rfc)
    classification_rep_rfc = classification_report(y_test, y_pred_rfc)
    conf_matrix = confusion_matrix(y_test, y_pred_rfc)
    
    # Append evaluation metrics to respective lists
    accuracy_scores.append(accuracy_rfc)
    classification_reports.append(classification_rep_rfc)
    confusion_matrices.append(conf_matrix)

# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)

# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Importance:", rfc.feature_importances_)
print("Accuracy:", accuracy_rfc)
print("Classification Report:")
print(classification_rep_rfc)
print("Confusion Matrix:")
print(conf_matrix)
Mean Accuracy: 0.8749698189134809
Importance: [0.04177033 0.04881799 0.0558487  0.04426561 0.04872195 0.10261489
 0.07658676 0.07670012 0.10828458 0.02473012 0.06631842 0.30534053]
Accuracy: 0.9
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.92      0.93        53
           1       0.78      0.82      0.80        17

    accuracy                           0.90        70
   macro avg       0.86      0.87      0.87        70
weighted avg       0.90      0.90      0.90        70

Confusion Matrix:
[[49  4]
 [ 3 14]]

c. XGBoost¶

In [26]:
from xgboost import XGBClassifier

# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []

# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Create and train the XGBoost classifier model
    xgb = XGBClassifier()
    xgb.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred_xgb = xgb.predict(X_test)

    # Evaluate the model's performance
    accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
    classification_rep = classification_report(y_test, y_pred_xgb)
    classification_rep_xgb = confusion_matrix(y_test, y_pred_xgb)
    
    # Append evaluation metrics to respective lists
    accuracy_scores.append(accuracy_xgb)
    classification_reports.append(classification_rep)
    confusion_matrices.append(classification_rep_xgb)

# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)

# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_xgb)
print("Classification Report:")
print(classification_rep_xgb)
print("Confusion Matrix:")
print(conf_matrix)
Mean Accuracy: 0.8636016096579476
Accuracy: 0.8714285714285714
Classification Report:
[[49  4]
 [ 5 12]]
Confusion Matrix:
[[49  4]
 [ 3 14]]

d. Support Vector Machine¶

In [27]:
from sklearn.svm import SVC

# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []

# Split data into training and testing sets with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Create the SVM classifier model
    svc = SVC()

    # Fit the model on the training data
    svc.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred_svc = svc.predict(X_test)

    # Evaluate the model's performance
    accuracy_svc = accuracy_score(y_test, y_pred_svc)
    classification_rep_svc = classification_report(y_test, y_pred_svc)
    conf_matrix = confusion_matrix(y_test, y_pred_svc)
    
    # Append evaluation metrics to respective lists
    accuracy_scores.append(accuracy_svc)
    classification_reports.append(classification_rep_svc)
    confusion_matrices.append(conf_matrix)

# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)

# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_svc)
print("Classification Report:")
print(classification_rep_svc)
print("Confusion Matrix:")
print(conf_matrix)
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
Mean Accuracy: 0.7414285714285713
Accuracy: 0.7571428571428571
Classification Report:
              precision    recall  f1-score   support

           0       0.76      1.00      0.86        53
           1       0.00      0.00      0.00        17

    accuracy                           0.76        70
   macro avg       0.38      0.50      0.43        70
weighted avg       0.57      0.76      0.65        70

Confusion Matrix:
[[53  0]
 [17  0]]
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\roymy\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

e. K-Nearest Neighbors, KNN¶

In [28]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Initialize lists to store evaluation metrics
accuracy_scores = []
classification_reports = []
confusion_matrices = []

# Perform train-test split with a test size of 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X_train_scaled):
    X_train_fold, X_test_fold = X_train_scaled[train_index], X_train_scaled[test_index]
    y_train_fold, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]
    
    # Create and train the k-nearest neighbors model
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_fold, y_train_fold)

    # Make predictions on the test set
    y_pred_knn = knn.predict(X_test_scaled)

    # Evaluate the model's performance
    accuracy_knn = accuracy_score(y_test, y_pred_knn)
    classification_rep_knn = classification_report(y_test, y_pred_knn)
    conf_matrix = confusion_matrix(y_test, y_pred_knn)
    
    # Append evaluation metrics to respective lists
    accuracy_scores.append(accuracy_knn)
    classification_reports.append(classification_rep_knn)
    confusion_matrices.append(conf_matrix)

# Calculate mean accuracy score
mean_accuracy = np.mean(accuracy_scores)

# Print mean accuracy score
print("Mean Accuracy:", mean_accuracy)
print("Accuracy:", accuracy_knn)
print("Classification Report:")
print(classification_rep_knn)
print("Confusion Matrix:")
print(conf_matrix)
Mean Accuracy: 0.8835227272727273
Accuracy: 0.8806818181818182
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.95      0.92       125
           1       0.86      0.71      0.77        51

    accuracy                           0.88       176
   macro avg       0.87      0.83      0.85       176
weighted avg       0.88      0.88      0.88       176

Confusion Matrix:
[[119   6]
 [ 15  36]]
In [29]:
# Grid Search for best hyperparameter k
param_grid = {'n_neighbors': range(1, 10)}
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

print("Best k:", grid_search.best_params_['n_neighbors'])
print("Best Accuracy:", grid_search.best_score_)
Best k: 5
Best Accuracy: 0.8522289766970618

Model Comaprison¶

In [30]:
models = {"LR": [accuracy_lr],
         "RFC":[accuracy_rfc],
         "XGB":[accuracy_xgb],
         "SVC":[accuracy_svc],
         "KNN": [accuracy_knn]}

# Create a DataFrame from the models dictionary
results_df = pd.DataFrame.from_dict(models, orient='index', columns=['Accuracy'])
results_df
Out[30]:
Accuracy
LR 0.857143
RFC 0.900000
XGB 0.871429
SVC 0.757143
KNN 0.880682

Predict playoff qualifiers by "Logistic Regression" for a given year (1990-2021), except for 1994, 2020 season.¶

In [31]:
import warnings
warnings.filterwarnings("ignore")

# predict playoff qualifiers for a given year (1990-2021), except for 1994, 2020 season.
for year in range(1990, 2022):
    if year != 1994 and year != 2020:
        _df = df[df["yearID"] == year].copy()
        X_year = _df[features]

        # Predict using the trained Logistic Regression model
        predicted_playoff_qualifiers_lr = lr.predict(X_year)

        # Add the predicted_playoff_qualifier column to the copied DataFrame
        _df["predicted_playoff_qualifier_lr"] = predicted_playoff_qualifiers_lr

        print("--------")
        print(year)
        print("--------")
        print("Actual playoff qualifiers in " + str(year) + ":")
        actual = set(_df[_df["make_playoffs_True"] == 1]["franchID"])
        print(actual)
        print("Predicted playoff qualifiers using Logistic Regression in " + str(year) + ":")
        predicted_lr = set(_df[_df["predicted_playoff_qualifier_lr"] == 1]["franchID"])
        print(predicted_lr)
        print()
        incorrect_lr = predicted_lr.difference(actual)
        print("Incorrect predictions using Logistic Regression (false positives) " + str(len(incorrect_lr)) + ":")
        print(incorrect_lr)
        exclusions_lr = actual.difference(predicted_lr)
        print("Incorrect exclusions from prediction using Logistic Regression (false negatives) " + str(len(exclusions_lr)) + ":")
        print(exclusions_lr)
        print()
        print()
--------
1990
--------
Actual playoff qualifiers in 1990:
{'PIT', 'BOS', 'OAK', 'CIN'}
Predicted playoff qualifiers using Logistic Regression in 1990:
{'PIT', 'WSN', 'SEA', 'TOR', 'OAK', 'NYM', 'CIN', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 5:
{'WSN', 'SEA', 'TOR', 'NYM', 'LAD'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 1:
{'BOS'}


--------
1991
--------
Actual playoff qualifiers in 1991:
{'PIT', 'TOR', 'MIN', 'ATL'}
Predicted playoff qualifiers using Logistic Regression in 1991:
{'PIT', 'MIN', 'ATL', 'TOR', 'CWS', 'NYM', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 3:
{'CWS', 'NYM', 'LAD'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 0:
set()


--------
1992
--------
Actual playoff qualifiers in 1992:
{'PIT', 'TOR', 'OAK', 'ATL'}
Predicted playoff qualifiers using Logistic Regression in 1992:
{'PIT', 'STL', 'WSN', 'MIN', 'ATL', 'TOR', 'OAK', 'BAL', 'CWS', 'MIL', 'CIN'}

Incorrect predictions using Logistic Regression (false positives) 7:
{'STL', 'WSN', 'MIN', 'BAL', 'CWS', 'MIL', 'CIN'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 0:
set()


--------
1993
--------
Actual playoff qualifiers in 1993:
{'TOR', 'CWS', 'PHI', 'ATL'}
Predicted playoff qualifiers using Logistic Regression in 1993:
{'WSN', 'ATL', 'TOR', 'HOU', 'PHI', 'SFG', 'CWS'}

Incorrect predictions using Logistic Regression (false positives) 3:
{'SFG', 'HOU', 'WSN'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 0:
set()


--------
1995
--------
Actual playoff qualifiers in 1995:
{'CLE', 'BOS', 'ATL', 'SEA', 'COL', 'CIN', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 1995:
{'CLE', 'CIN', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'BOS', 'SEA', 'COL', 'NYY', 'LAD'}


--------
1996
--------
Actual playoff qualifiers in 1996:
{'STL', 'CLE', 'ATL', 'BAL', 'SDP', 'NYY', 'LAD', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 1996:
{'CLE', 'SDP', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'STL', 'BAL', 'NYY', 'LAD', 'TEX'}


--------
1997
--------
Actual playoff qualifiers in 1997:
{'CLE', 'ATL', 'SEA', 'HOU', 'BAL', 'SFG', 'MIA', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 1997:
{'ATL', 'HOU', 'MIA', 'NYY', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'LAD'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'CLE', 'BAL', 'SFG', 'SEA'}


--------
1998
--------
Actual playoff qualifiers in 1998:
{'CLE', 'BOS', 'ATL', 'HOU', 'CHC', 'SDP', 'NYY', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 1998:
{'SDP', 'HOU', 'NYY', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'CLE', 'BOS', 'CHC', 'TEX'}


--------
1999
--------
Actual playoff qualifiers in 1999:
{'CLE', 'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'ARI', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 1999:
{'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'CIN', 'ARI'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'CIN'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 2:
{'CLE', 'TEX'}


--------
2000
--------
Actual playoff qualifiers in 2000:
{'STL', 'ATL', 'SEA', 'OAK', 'SFG', 'CWS', 'NYM', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 2000:
{'SFG', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'STL', 'SEA', 'OAK', 'CWS', 'NYM', 'NYY'}


--------
2001
--------
Actual playoff qualifiers in 2001:
{'STL', 'CLE', 'ATL', 'SEA', 'HOU', 'OAK', 'NYY', 'ARI'}
Predicted playoff qualifiers using Logistic Regression in 2001:
{'OAK', 'ARI', 'SEA', 'NYY'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'STL', 'CLE', 'HOU', 'ATL'}


--------
2002
--------
Actual playoff qualifiers in 2002:
{'STL', 'MIN', 'LAA', 'ATL', 'OAK', 'NYY', 'SFG', 'ARI'}
Predicted playoff qualifiers using Logistic Regression in 2002:
{'STL', 'BOS', 'LAA', 'ATL', 'SEA', 'OAK', 'NYY', 'SFG', 'ARI'}

Incorrect predictions using Logistic Regression (false positives) 2:
{'BOS', 'SEA'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 1:
{'MIN'}


--------
2003
--------
Actual playoff qualifiers in 2003:
{'BOS', 'MIN', 'ATL', 'OAK', 'SFG', 'CHC', 'MIA', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 2003:
{'SFG', 'OAK', 'NYY', 'SEA'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'SEA'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'BOS', 'MIN', 'ATL', 'CHC', 'MIA'}


--------
2004
--------
Actual playoff qualifiers in 2004:
{'STL', 'BOS', 'MIN', 'LAA', 'ATL', 'HOU', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2004:
{'STL', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'BOS', 'MIN', 'LAA', 'HOU', 'NYY', 'LAD'}


--------
2005
--------
Actual playoff qualifiers in 2005:
{'STL', 'BOS', 'LAA', 'ATL', 'HOU', 'CWS', 'SDP', 'NYY'}
Predicted playoff qualifiers using Logistic Regression in 2005:
{'STL', 'CLE', 'LAA', 'HOU', 'CWS'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'CLE'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'SDP', 'BOS', 'NYY', 'ATL'}


--------
2006
--------
Actual playoff qualifiers in 2006:
{'DET', 'STL', 'MIN', 'OAK', 'NYM', 'SDP', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2006:
{'NYM', 'NYY'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'DET', 'STL', 'MIN', 'OAK', 'SDP', 'LAD'}


--------
2007
--------
Actual playoff qualifiers in 2007:
{'CLE', 'BOS', 'LAA', 'COL', 'PHI', 'NYY', 'CHC', 'ARI'}
Predicted playoff qualifiers using Logistic Regression in 2007:
{'BOS', 'NYY', 'NYM'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'NYM'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'CLE', 'LAA', 'COL', 'PHI', 'CHC', 'ARI'}


--------
2008
--------
Actual playoff qualifiers in 2008:
{'BOS', 'LAA', 'PHI', 'TBR', 'CWS', 'CHC', 'MIL', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2008:
{'BOS', 'TOR', 'PHI', 'TBR', 'NYM', 'CHC', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 2:
{'TOR', 'NYM'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 3:
{'CWS', 'LAA', 'MIL'}


--------
2009
--------
Actual playoff qualifiers in 2009:
{'STL', 'BOS', 'MIN', 'LAA', 'COL', 'PHI', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2009:
{'BOS', 'NYY', 'LAD', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'ATL'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 5:
{'STL', 'MIN', 'LAA', 'COL', 'PHI'}


--------
2010
--------
Actual playoff qualifiers in 2010:
{'MIN', 'ATL', 'PHI', 'TBR', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2010:
{'STL', 'ATL', 'PHI', 'TBR', 'SDP', 'NYY'}

Incorrect predictions using Logistic Regression (false positives) 2:
{'STL', 'SDP'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'SFG', 'CIN', 'MIN', 'TEX'}


--------
2011
--------
Actual playoff qualifiers in 2011:
{'DET', 'STL', 'PHI', 'NYY', 'TBR', 'MIL', 'ARI', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2011:
{'TBR', 'NYY', 'PHI', 'TEX'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 4:
{'DET', 'STL', 'MIL', 'ARI'}


--------
2012
--------
Actual playoff qualifiers in 2012:
{'DET', 'STL', 'WSN', 'ATL', 'OAK', 'BAL', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2012:
{'TBR', 'WSN', 'NYY', 'ATL'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'TBR'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 7:
{'DET', 'STL', 'OAK', 'BAL', 'SFG', 'CIN', 'TEX'}


--------
2013
--------
Actual playoff qualifiers in 2013:
{'DET', 'CLE', 'PIT', 'BOS', 'STL', 'ATL', 'OAK', 'TBR', 'CIN', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2013:
{'BOS', 'ATL', 'OAK', 'CIN', 'TEX'}

Incorrect predictions using Logistic Regression (false positives) 1:
{'TEX'}
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'DET', 'CLE', 'PIT', 'STL', 'TBR', 'LAD'}


--------
2014
--------
Actual playoff qualifiers in 2014:
{'DET', 'PIT', 'STL', 'WSN', 'LAA', 'OAK', 'BAL', 'SFG', 'LAD', 'KCR'}
Predicted playoff qualifiers using Logistic Regression in 2014:
{'OAK', 'LAD', 'WSN'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 7:
{'DET', 'PIT', 'STL', 'LAA', 'BAL', 'SFG', 'KCR'}


--------
2015
--------
Actual playoff qualifiers in 2015:
{'PIT', 'STL', 'TOR', 'HOU', 'NYM', 'CHC', 'TEX', 'NYY', 'LAD', 'KCR'}
Predicted playoff qualifiers using Logistic Regression in 2015:
{'STL', 'TOR'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'PIT', 'HOU', 'KCR', 'NYM', 'CHC', 'NYY', 'LAD', 'TEX'}


--------
2016
--------
Actual playoff qualifiers in 2016:
{'CLE', 'BOS', 'WSN', 'TOR', 'BAL', 'SFG', 'NYM', 'CHC', 'LAD', 'TEX'}
Predicted playoff qualifiers using Logistic Regression in 2016:
{'WSN', 'CHC'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'CLE', 'BOS', 'TOR', 'BAL', 'SFG', 'NYM', 'LAD', 'TEX'}


--------
2017
--------
Actual playoff qualifiers in 2017:
{'CLE', 'BOS', 'MIN', 'WSN', 'COL', 'HOU', 'NYY', 'CHC', 'ARI', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2017:
{'CLE', 'ARI', 'LAD', 'NYY'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'BOS', 'MIN', 'WSN', 'COL', 'HOU', 'CHC'}


--------
2018
--------
Actual playoff qualifiers in 2018:
{'CLE', 'BOS', 'ATL', 'COL', 'HOU', 'OAK', 'CHC', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2018:
{'CLE', 'HOU', 'BOS', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 6:
{'ATL', 'COL', 'OAK', 'CHC', 'MIL', 'NYY'}


--------
2019
--------
Actual playoff qualifiers in 2019:
{'STL', 'WSN', 'MIN', 'ATL', 'HOU', 'OAK', 'TBR', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2019:
{'HOU', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'STL', 'WSN', 'MIN', 'ATL', 'OAK', 'TBR', 'MIL', 'NYY'}


--------
2021
--------
Actual playoff qualifiers in 2021:
{'STL', 'BOS', 'ATL', 'HOU', 'TBR', 'SFG', 'CWS', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Logistic Regression in 2021:
{'SFG', 'LAD'}

Incorrect predictions using Logistic Regression (false positives) 0:
set()
Incorrect exclusions from prediction using Logistic Regression (false negatives) 8:
{'STL', 'BOS', 'ATL', 'HOU', 'TBR', 'CWS', 'MIL', 'NYY'}


Predict playoff qualifiers by "Random Forest" for a given year (1990-2021), except for 1994, 2020 season.¶

In [32]:
import warnings
warnings.filterwarnings("ignore")

# predict playoff qualifiers for a given year (1990-2021), except for 1994, 2020 season.
for year in range(1990, 2022):
    if year != 1994 and year != 2020:
        _df = df[df["yearID"] == year].copy()
        X_year = _df[features]

        # Predict using the trained XGBoosting model.
        predicted_playoff_qualifiers_rfc = rfc.predict(X_year)

        # Add the predicted_playoff_qualifier column to the copied DataFrame
        _df["predicted_playoff_qualifier_rfc"] = predicted_playoff_qualifiers_rfc

        print("--------")
        print(year)
        print("--------")
        print("Actual playoff qualifiers in " + str(year) + ":")
        actual = set(_df[_df["make_playoffs_True"] == 1]["franchID"])
        print(actual)
        print("Predicted playoff qualifiers using Random Forest in " + str(year) + ":")
        predicted_rfc = set(_df[_df["predicted_playoff_qualifier_rfc"] == 1]["franchID"])
        print(predicted_rfc)
        print()
        incorrect_rfc = predicted_rfc.difference(actual)
        print("Incorrect predictions using Random Forest (false positives) " + str(len(incorrect_rfc)) + ":")
        print(incorrect_rfc)
        exclusions_rfc = actual.difference(predicted_rfc)
        print("Incorrect exclusions from prediction using Random Forest (false negatives) " + str(len(exclusions_rfc)) + ":")
        print(exclusions_rfc)
        print()
        print()
--------
1990
--------
Actual playoff qualifiers in 1990:
{'PIT', 'BOS', 'OAK', 'CIN'}
Predicted playoff qualifiers using Random Forest in 1990:
{'PIT', 'BOS', 'OAK', 'CIN'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1991
--------
Actual playoff qualifiers in 1991:
{'PIT', 'TOR', 'MIN', 'ATL'}
Predicted playoff qualifiers using Random Forest in 1991:
{'PIT', 'TOR', 'MIN', 'ATL'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1992
--------
Actual playoff qualifiers in 1992:
{'PIT', 'TOR', 'OAK', 'ATL'}
Predicted playoff qualifiers using Random Forest in 1992:
{'TOR', 'OAK', 'ATL'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'PIT'}


--------
1993
--------
Actual playoff qualifiers in 1993:
{'TOR', 'CWS', 'PHI', 'ATL'}
Predicted playoff qualifiers using Random Forest in 1993:
{'DET', 'ATL', 'TOR', 'PHI', 'SFG', 'CWS'}

Incorrect predictions using Random Forest (false positives) 2:
{'DET', 'SFG'}
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1995
--------
Actual playoff qualifiers in 1995:
{'CLE', 'BOS', 'ATL', 'SEA', 'COL', 'CIN', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 1995:
{'CLE', 'BOS', 'ATL', 'SEA', 'COL', 'CIN', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1996
--------
Actual playoff qualifiers in 1996:
{'STL', 'CLE', 'ATL', 'BAL', 'SDP', 'NYY', 'LAD', 'TEX'}
Predicted playoff qualifiers using Random Forest in 1996:
{'STL', 'CLE', 'ATL', 'BAL', 'SDP', 'NYY', 'LAD', 'TEX'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1997
--------
Actual playoff qualifiers in 1997:
{'CLE', 'ATL', 'SEA', 'HOU', 'BAL', 'SFG', 'MIA', 'NYY'}
Predicted playoff qualifiers using Random Forest in 1997:
{'CLE', 'ATL', 'SEA', 'HOU', 'BAL', 'SFG', 'MIA', 'NYY'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1998
--------
Actual playoff qualifiers in 1998:
{'CLE', 'BOS', 'ATL', 'HOU', 'CHC', 'SDP', 'NYY', 'TEX'}
Predicted playoff qualifiers using Random Forest in 1998:
{'CLE', 'BOS', 'ATL', 'HOU', 'CHC', 'SDP', 'NYY', 'TEX'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
1999
--------
Actual playoff qualifiers in 1999:
{'CLE', 'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'ARI', 'TEX'}
Predicted playoff qualifiers using Random Forest in 1999:
{'CLE', 'BOS', 'ATL', 'HOU', 'NYY', 'NYM', 'ARI', 'TEX'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2000
--------
Actual playoff qualifiers in 2000:
{'STL', 'ATL', 'SEA', 'OAK', 'SFG', 'CWS', 'NYM', 'NYY'}
Predicted playoff qualifiers using Random Forest in 2000:
{'STL', 'ATL', 'SEA', 'OAK', 'SFG', 'CWS', 'NYM', 'NYY'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2001
--------
Actual playoff qualifiers in 2001:
{'STL', 'CLE', 'ATL', 'SEA', 'HOU', 'OAK', 'NYY', 'ARI'}
Predicted playoff qualifiers using Random Forest in 2001:
{'STL', 'CLE', 'ATL', 'SEA', 'HOU', 'OAK', 'NYY', 'ARI'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2002
--------
Actual playoff qualifiers in 2002:
{'STL', 'MIN', 'LAA', 'ATL', 'OAK', 'NYY', 'SFG', 'ARI'}
Predicted playoff qualifiers using Random Forest in 2002:
{'STL', 'MIN', 'LAA', 'ATL', 'OAK', 'NYY', 'SFG', 'ARI'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2003
--------
Actual playoff qualifiers in 2003:
{'BOS', 'MIN', 'ATL', 'OAK', 'SFG', 'CHC', 'MIA', 'NYY'}
Predicted playoff qualifiers using Random Forest in 2003:
{'STL', 'BOS', 'MIN', 'ATL', 'OAK', 'SFG', 'CHC', 'MIA', 'NYY'}

Incorrect predictions using Random Forest (false positives) 1:
{'STL'}
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2004
--------
Actual playoff qualifiers in 2004:
{'STL', 'BOS', 'MIN', 'LAA', 'ATL', 'HOU', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2004:
{'STL', 'BOS', 'MIN', 'LAA', 'ATL', 'HOU', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2005
--------
Actual playoff qualifiers in 2005:
{'STL', 'BOS', 'LAA', 'ATL', 'HOU', 'CWS', 'SDP', 'NYY'}
Predicted playoff qualifiers using Random Forest in 2005:
{'STL', 'BOS', 'ATL', 'HOU', 'CWS', 'SDP', 'NYY'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'LAA'}


--------
2006
--------
Actual playoff qualifiers in 2006:
{'DET', 'STL', 'MIN', 'OAK', 'NYM', 'SDP', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2006:
{'DET', 'STL', 'MIN', 'OAK', 'NYM', 'SDP', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2007
--------
Actual playoff qualifiers in 2007:
{'CLE', 'BOS', 'LAA', 'COL', 'PHI', 'NYY', 'CHC', 'ARI'}
Predicted playoff qualifiers using Random Forest in 2007:
{'CLE', 'BOS', 'LAA', 'COL', 'PHI', 'NYY', 'CHC', 'ARI'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2008
--------
Actual playoff qualifiers in 2008:
{'BOS', 'LAA', 'PHI', 'TBR', 'CWS', 'CHC', 'MIL', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2008:
{'BOS', 'LAA', 'PHI', 'TBR', 'CWS', 'NYM', 'CHC', 'MIL'}

Incorrect predictions using Random Forest (false positives) 1:
{'NYM'}
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'LAD'}


--------
2009
--------
Actual playoff qualifiers in 2009:
{'STL', 'BOS', 'MIN', 'LAA', 'COL', 'PHI', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2009:
{'STL', 'BOS', 'MIN', 'LAA', 'COL', 'PHI', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2010
--------
Actual playoff qualifiers in 2010:
{'MIN', 'ATL', 'PHI', 'TBR', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2010:
{'MIN', 'ATL', 'PHI', 'TBR', 'SFG', 'CIN', 'NYY', 'TEX'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2011
--------
Actual playoff qualifiers in 2011:
{'DET', 'STL', 'PHI', 'NYY', 'TBR', 'MIL', 'ARI', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2011:
{'DET', 'STL', 'PHI', 'NYY', 'TBR', 'MIL', 'ARI', 'TEX'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2012
--------
Actual playoff qualifiers in 2012:
{'DET', 'STL', 'WSN', 'ATL', 'OAK', 'BAL', 'SFG', 'CIN', 'NYY', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2012:
{'DET', 'STL', 'WSN', 'ATL', 'OAK', 'BAL', 'SFG', 'CIN', 'NYY', 'TEX'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2013
--------
Actual playoff qualifiers in 2013:
{'DET', 'CLE', 'PIT', 'BOS', 'STL', 'ATL', 'OAK', 'TBR', 'CIN', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2013:
{'DET', 'CLE', 'PIT', 'BOS', 'STL', 'ATL', 'OAK', 'TBR', 'CIN', 'LAD'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2014
--------
Actual playoff qualifiers in 2014:
{'DET', 'PIT', 'STL', 'WSN', 'LAA', 'OAK', 'BAL', 'SFG', 'LAD', 'KCR'}
Predicted playoff qualifiers using Random Forest in 2014:
{'DET', 'PIT', 'STL', 'WSN', 'LAA', 'OAK', 'BAL', 'SFG', 'LAD', 'KCR'}

Incorrect predictions using Random Forest (false positives) 0:
set()
Incorrect exclusions from prediction using Random Forest (false negatives) 0:
set()


--------
2015
--------
Actual playoff qualifiers in 2015:
{'PIT', 'STL', 'TOR', 'HOU', 'NYM', 'CHC', 'TEX', 'NYY', 'LAD', 'KCR'}
Predicted playoff qualifiers using Random Forest in 2015:
{'PIT', 'STL', 'WSN', 'TOR', 'HOU', 'SFG', 'NYM', 'CHC', 'LAD'}

Incorrect predictions using Random Forest (false positives) 2:
{'SFG', 'WSN'}
Incorrect exclusions from prediction using Random Forest (false negatives) 3:
{'KCR', 'NYY', 'TEX'}


--------
2016
--------
Actual playoff qualifiers in 2016:
{'CLE', 'BOS', 'WSN', 'TOR', 'BAL', 'SFG', 'NYM', 'CHC', 'LAD', 'TEX'}
Predicted playoff qualifiers using Random Forest in 2016:
{'CLE', 'BOS', 'WSN', 'SEA', 'TOR', 'SFG', 'CHC', 'LAD'}

Incorrect predictions using Random Forest (false positives) 1:
{'SEA'}
Incorrect exclusions from prediction using Random Forest (false negatives) 3:
{'NYM', 'TEX', 'BAL'}


--------
2017
--------
Actual playoff qualifiers in 2017:
{'CLE', 'BOS', 'MIN', 'WSN', 'COL', 'HOU', 'NYY', 'CHC', 'ARI', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2017:
{'STL', 'CLE', 'BOS', 'WSN', 'HOU', 'NYY', 'CHC', 'ARI', 'LAD'}

Incorrect predictions using Random Forest (false positives) 1:
{'STL'}
Incorrect exclusions from prediction using Random Forest (false negatives) 2:
{'COL', 'MIN'}


--------
2018
--------
Actual playoff qualifiers in 2018:
{'CLE', 'BOS', 'ATL', 'COL', 'HOU', 'OAK', 'CHC', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2018:
{'CLE', 'BOS', 'WSN', 'ATL', 'HOU', 'OAK', 'TBR', 'CHC', 'MIL', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 2:
{'TBR', 'WSN'}
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'COL'}


--------
2019
--------
Actual playoff qualifiers in 2019:
{'STL', 'WSN', 'MIN', 'ATL', 'HOU', 'OAK', 'TBR', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2019:
{'STL', 'CLE', 'WSN', 'MIN', 'ATL', 'HOU', 'OAK', 'TBR', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 1:
{'CLE'}
Incorrect exclusions from prediction using Random Forest (false negatives) 1:
{'MIL'}


--------
2021
--------
Actual playoff qualifiers in 2021:
{'STL', 'BOS', 'ATL', 'HOU', 'TBR', 'SFG', 'CWS', 'MIL', 'NYY', 'LAD'}
Predicted playoff qualifiers using Random Forest in 2021:
{'ATL', 'TOR', 'HOU', 'OAK', 'TBR', 'SFG', 'CWS', 'MIL', 'NYY', 'LAD'}

Incorrect predictions using Random Forest (false positives) 2:
{'TOR', 'OAK'}
Incorrect exclusions from prediction using Random Forest (false negatives) 2:
{'STL', 'BOS'}


Conclusion¶

In conclusion, the analysis reveals interesting insights into the performance of different predictive models and key factors influencing playoff success in baseball. Random Forest emerges as the most accurate model, showcasing its robustness in capturing complex relationships within the data. Conversely, SVM demonstrates lower accuracy, indicating potential limitations in handling certain data patterns. The correlation of P% with playoff success underscores the importance of offensive prowess in determining a team's postseason trajectory. On the pitching front, metrics such as EPA, WHIP, and BAA exhibit a negative impact, suggesting that pitching performance alone may not suffice for playoff advancement.¶

However, it's crucial to acknowledge the limitations inherent in this study. The small sample size could introduce bias and limit the generalizability of the findings. Additionally, the difficulty in comparing teams individually in the playoffs poses a significant challenge for making precise predictions about World Series contenders and eventual champions.¶

Looking ahead, future research directions should prioritize expanding the dataset to include more seasons and teams, enabling a more comprehensive analysis. Exploring advanced statistical techniques and incorporating additional variables could further enhance predictive accuracy. Moreover, developing methodologies to effectively compare teams in the playoff context would be instrumental in refining prediction models and providing actionable insights for stakeholders in the baseball community. By addressing these limitations and advancing research methodologies, we can strive towards a more nuanced understanding of playoff dynamics and improve our ability to forecast World Series outcomes with greater confidence.¶